-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding MultiModal ShardedDataloader #262
base: main
Are you sure you want to change the base?
Adding MultiModal ShardedDataloader #262
Conversation
sub_shards.extend(shards[i]) | ||
if len(sub_shards) == num_uri_merge: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are expending sub_shards
by a list, isn't it possible that size of sub_shards
exceeds num_uri_merge
, so the condition will be false and we continue to expand sub_shards
?
class S3ShardSampler(ShardSampler, pl.core.hooks.CheckpointHooks): | ||
def __init__(self, uri: str, glob: Optional[str] = None, recursive: bool = True, num_uri_merge: int = 0): | ||
s3_client = self._get_client() | ||
self.shards: List[str] = list(get_objects_from_uris(uri, s3_client) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it missing a closing parenthesis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, what is the idea behind that line? uri
represents one object, so get_objects_from_uris
will return one S3BucketKeyData
that has passed uri
parsed way to represent bucket
and key
. As I understood self.shards
is not expected to get S3BucketKeyData
.
Description
Adding a MultiModal Sharded Dataloader and several of the samplers to be used
Additional context
ShardedDataloader is a flexable multimodal dataloader that attempts to solve the use case of large scale llm/vlm training. A single backed instance of S3 dataloader is not sufficient for the following reasons will face the following issues which this dataloader attempts to address:
Related items
Testing
By submitting this pull request, I confirm that my contribution is made under the terms of BSD 3-Clause License and I agree to the terms of the LICENSE.